On Aligning Massive Time-Series Data in Splash

نویسندگان

  • Peter J. Haas
  • Yannis Sismanis
چکیده

Important emerging sources of big data are large-scale predictive simulation models used in e-science and, increasingly, in guiding policy and investment decisions around highly complex issues such as population health and safety. The Splash project provides a platform for combining existing heterogeneous simulation models and datasets across a broad range of disciplines to capture the behavior of complex systems of systems. Splash loosely couples models via data exchange, where each submodel often produces or expects time series having huge numbers of time points and many data values per time point. If the time-series output of one “source” submodel is used as input for another “target” submodel and the time granularity of the source is coarser than that of the target, an interpolation operation is required. Cubic-spline interpolation is the most widely-used method because of its smoothness properties. Scalable methods are needed for such data transformations, because the amount of data produced by a simulation program can be massive when simulating large, complex systems over long time periods, especially when the time dimension is modeled at high resolution. We demonstrate that we can efficiently perform cubic-spline interpolation over a massive time series in a MapReduce environment using novel algorithms based on adapting the distributed stochastic gradient descent (DSGD) method of Gemulla et al., originally developed for low-rank matrix factorization. Specifically, we adapt DSGD to calculate the coefficients that appear in the cubic-spline interpolation formula by solving a massive tridiagonal system of linear equations. Our techniques are potentially applicable to both spline interpolation and parallel solution of diagonal linear systems in other massively parallel data-integration and data-analysis applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spatiotemporal analysis of remotely sensed Landsat time series data for monitoring 32 years of urbanization

The world is witnessing a dramatic shift of settlement pattern from rural to urban population, particularly in developing countries. The rapid Addis Ababa urbanization reflects this global phenomenon and the subsequent socio-economic and environmental impacts, are causing massive public uproar and political instability. The objective of this study was to use remotely sensed Landsat data to iden...

متن کامل

Searching Genetic Databases on Splash 2

Hoang’s paper demonstrated the absolutely massive speedup potential present in FPGA acceleration for DNA sequence matching, showing that a single Splash 2 board (consisting of 17 Xilinx 4010 FPGAs) should be two orders of magnitude faster than a MasPar MP-1, a SIMD supercomputer uniquely suited to this problem. A full 16-board Splash 2 configuration would be sixteen times faster. Comparisons wi...

متن کامل

On the Detection of Trends in Time Series of Functional Data

A sequence of functions (curves) collected over time is called a functional time series. Functional time series analysis is one of the popular research areas in which statistics from such data are frequently observed. The main purpose of the functional time series is to predict and describe random mechanisms that resulted in generating the data. To do so, it is needed to decompose functional ti...

متن کامل

Modeling of sulfur dioxide emissions in Ahvaz City, southwest of Iran during 2013

Sulfur dioxide has two important sources in the atmosphere and this is why most of scientists believe in a geographic split in the globe. Power plants, major emitter of SO2, are located in north hemisphere such as in Russia, China, Canada and the USA. In south hemisphere, phytoplankton produces a massive amount of dimethyl sulfide (DMS) and dimethyl disulfide (DMDS). Then these types of reduced...

متن کامل

Fitting of Count Time Series Models on the Number of Patients Referred to Addiction Treatment Centers in Semnan County

Abstract. Count data over time are observed in many application areas. Many researchers use time series patterns to analyze this data. In this paper, the poisson count time series linear models and negative binomials on this type of data with the explanatory variables are studied. The Likelihood analysis and the evaluation of count time series model based on generalized linear models are pres...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012